Mark Dunning, Suraj Menon, and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts
Last modified: 14 Dec 2015
https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
http://www.revolutionanalytics.com/companies-using-r
Sidney Harris - New York Times
R)To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
2 + 2
## [1] 4
20/5 - sqrt(25) + 3^2
## [1] 8
sin(pi/2)
## [1] 1
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
<-x <- 10
x
## [1] 10
myNumber <- 25
myNumber
## [1] 25
sqrt(myNumber)
## [1] 5
x + myNumber
## [1] 35
x <- 21
x
## [1] 21
x <- myNumber
x
## [1] 25
myNumber <- myNumber + sqrt(16)
myNumber
## [1] 29
sin(x)
this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
sum(3,4,5,6)
## [1] 18
max(3,4,5,6)
## [1] 6
min(3,4,5,6)
## [1] 3
seq(from = 2, to = 20, by = 4)
## [1] 2 6 10 14 18
seq(2, 20, 4)
## [1] 2 6 10 14 18
c combines its arguments into a vector:x <- c(3,4,5,6)
x
## [1] 3 4 5 6
[] indicate the position within the vector (the index). We can extract individual elements by using the [] notation:x[1]
## [1] 3
x[4]
## [1] 6
y <- c(2,3)
x[y]
## [1] 4 5
x <- c(3,4,5,6,7,8,9,10,11,12)
x <- 3:12
x
## [1] 3 4 5 6 7 8 9 10 11 12
seq() function, which returns a vector:x <- seq(2, 20, 4)
x
## [1] 2 6 10 14 18
x <- seq(2, 20, length.out=5)
x
## [1] 2.0 6.5 11.0 15.5 20.0
rep() function:y <- rep(3, 5)
y
## [1] 3 3 3 3 3
y <- rep(1:3, 5)
y
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
x <- 3:12
x[3:7]
## [1] 5 6 7 8 9
x[seq(2, 6, 2)]
## [1] 4 6 8
x[rep(3, 2)]
## [1] 5 5
y <- c(x, 1)
y
## [1] 3 4 5 6 7 8 9 10 11 12 1
z <- c(x, y)
z
## [1] 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 1
x <- 3:12
x[-3]
## [1] 3 4 6 7 8 9 10 11 12
x[-(5:7)]
## [1] 3 4 5 6 10 11 12
x[-seq(2, 6, 2)]
## [1] 3 5 7 9 10 11 12
x[6] <- 4
x
## [1] 3 4 5 6 7 4 9 10 11 12
x[3:5] <- 1
x
## [1] 3 4 1 1 1 4 9 10 11 12
Remember! - Square brackets for indexing [] - parentheses for function arguments ().
x <- 1:10
y <- x*2
y
## [1] 2 4 6 8 10 12 14 16 18 20
z <- x^2
z
## [1] 1 4 9 16 25 36 49 64 81 100
y + z
## [1] 3 8 15 24 35 48 63 80 99 120
x + 1:2
## [1] 2 4 4 6 6 8 8 10 10 12
x + 1:3
## Warning in x + 1:3: longer object length is not a
## multiple of shorter object length
## [1] 2 4 6 5 7 9 8 10 12 11
gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
names function, which can be useful to keep track of the meaning of our data:gene.expression <- c(0, 3.2, 1.2, -2)
gene.expression
## [1] 0.0 3.2 1.2 -2.0
names(gene.expression) <- gene.names
gene.expression
## Pax6 Beta-actin FoxP2 Hox9
## 0.0 3.2 1.2 -2.0
names function to get a vector of the names of an object:names(gene.expression)
## [1] "Pax6" "Beta-actin" "FoxP2"
## [4] "Hox9"
| Species | Genome size (Mb) | Protein coding genes |
|---|---|---|
| Homo sapiens | 3,102 | 20,774 |
| Mus musculus | 2,731 | 23,139 |
| Drosophila melanogaster | 169 | 13,937 |
| Caenorhabditis elegans | 100 | 20,532 |
| Saccharomyces cerevisiae | 12 | 6,692 |
c function. Create a species.name vector and use this vector to name the values in the other two vectors.coding.bases.coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)genome.size <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
"D. melanogaster","C. elegans",
"S. cerevisiae")
names(genome.size) <- species.name
names(coding.genes) <- species.name
coding.bases <- coding.genes*0.0015
coding.bases
## H. sapiens M. musculus D. melanogaster
## 31.1610 34.7085 20.9055
## C. elegans S. cerevisiae
## 30.7980 10.0380
coding.pc <- coding.bases/genome.size*100
coding.pc
## H. sapiens M. musculus D. melanogaster
## 1.004545 1.270908 12.370118
## C. elegans S. cerevisiae
## 30.798000 83.650000
coding.bases[1]/coding.bases[5]
## H. sapiens
## 3.104304
genome.size[1]/genome.size[5]
## H. sapiens
## 258.5
coding.pc) but sometimes it is not (when we are comparing human to yeast). We can remove names by setting them to the special NULL value:names(coding.pc) <- NULL
coding.pc
## [1] 1.004545 1.270908 12.370118 30.798000
## [5] 83.650000
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
solution-exercise1.Rmd for solution to Exercise 1? followed by the function name. For example:?seq
example function:example(seq)
?? followed by your guess. R will return a list of possibilities:??plot
; end of line (Enables multiple commands to be placed on one line of text)# comment (indicates text is a comment and not executed)+ command line wrap (R is waiting for you to complete an expression)sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats packageinstall.packages
install.packages(name.of.my.package)
source("http://bioconductor.org/biocLite.R")
biocLite() function:biocLite("PackageName")
install.packages() function to install it:install.packages("ggplot2")
DESeq is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq datasource("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
library(...) function to load the newly installed features:library(ggplot2) # loads ggplot functions
library(DESeq) # loads DESeq functions
library() # Lists all the packages you've got installed
| First_Name | Second_Name | Full_Name | Sex | Age | Weight | Consent | |
|---|---|---|---|---|---|---|---|
| 1 | Adam | Jones | Adam Jones | Male | 50 | 70.8 | TRUE |
| 2 | Eve | Parker | Eve Parker | Female | 21 | 67.9 | TRUE |
| 3 | John | Evans | John Evans | Male | 35 | 75.3 | FALSE |
| 4 | Mary | Davis | Mary Davis | Female | 45 | 61.9 | TRUE |
| 5 | Peter | Baker | Peter Baker | Male | 28 | 72.4 | FALSE |
| 6 | Paul | Daniels | Paul Daniels | Male | 31 | 69.9 | FALSE |
| 7 | Joanna | Edwards | Joanna Edwards | Female | 42 | 63.5 | FALSE |
| 8 | Matthew | Smith | Matthew Smith | Male | 33 | 71.5 | TRUE |
| 9 | David | Roberts | David Roberts | Male | 57 | 73.2 | FALSE |
| 10 | Sally | Wilson | Sally Wilson | Female | 62 | 64.8 | TRUE |
age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5,
71.5, 73.2, 64.8)
firstName <- c("Adam", "Eve", "John", "Mary", "Peter",
"Paul", "Joanna", "Matthew", "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
"Baker","Daniels", "Edwards", "Smith", "Roberts", "Wilson")
TRUE and FALSE:consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, TRUE)
c(20, "a string", TRUE)
## [1] "20" "a string" "TRUE"
class function: class(firstName)
## [1] "character"
class(age)
## [1] "numeric"
class(weight)
## [1] "numeric"
class(consent)
## [1] "logical"
sex <- c("Male", "Female", "Male", "Female", "Male", "Male",
"Female", "Male", "Male", "Female")
sex
## [1] "Male" "Female" "Male" "Female" "Male" "Male"
## [7] "Female" "Male" "Male" "Female"
factor(sex)
## [1] Male Female Male Female Male Male Female Male
## [9] Male Female
## Levels: Female Male
paste function joins character vectors together)patients <- data.frame(firstName, secondName,
paste(firstName, secondName),
sex, age, weight, consent)
patients
## firstName secondName paste.firstName..secondName. sex age weight consent
## 1 Adam Jones Adam Jones Male 50 70.8 TRUE
## 2 Eve Parker Eve Parker Female 21 67.9 TRUE
## 3 John Evans John Evans Male 35 75.3 FALSE
## 4 Mary Davis Mary Davis Female 45 61.9 TRUE
## 5 Peter Baker Peter Baker Male 28 72.4 FALSE
## 6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
## 7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
## 8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
## 9 David Roberts David Roberts Male 57 73.2 FALSE
## 10 Sally Wilson Sally Wilson Female 62 64.8 TRUE
$’ operator:patients$age
## [1] 50 21 35 45 28 31 42 33 57 62
paste command)names function, and we can use the same function to see the names:names(patients) <- c("First_Name", "Second_Name",
"Full_Name", "Sex", "Age", "Weight", "Consent")
names(patients)
## [1] "First_Name" "Second_Name"
## [3] "Full_Name" "Sex"
## [5] "Age" "Weight"
## [7] "Consent"
patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName,secondName),
Sex = sex,
Age = age,
Weight = weight,
Consent = consent)
names(patients)
## [1] "First_Name" "Second_Name"
## [3] "Full_Name" "Sex"
## [5] "Age" "Weight"
## [7] "Consent"
patients$First_Name
## [1] Adam Eve John Mary
## [5] Peter Paul Joanna Matthew
## [9] David Sally
## 10 Levels: Adam David Eve ... Sally
factor:patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName, secondName),
Sex = factor(sex),
Age = age,
Weight = weight,
Consent = consent,
stringsAsFactors = FALSE)
patients$Sex
## [1] Male Female Male Female Male
## [6] Male Female Male Male Female
## Levels: Female Male
patients$First_Name
## [1] "Adam" "Eve" "John"
## [4] "Mary" "Peter" "Paul"
## [7] "Joanna" "Matthew" "David"
## [10] "Sally"
all columns are assumed to contain the same data type. e.g. numerical
e <- matrix(1:10, nrow=5, ncol=2)
e
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
rowMeans(e)
## [1] 3.5 4.5 5.5 6.5 7.5
object[rows, colums]
e[1,2]
## [1] 6
e[1,]
## [1] 1 6
patients[1,2]
## [1] "Jones"
patients[1,]
## First_Name Second_Name Full_Name
## 1 Adam Jones Adam Jones
## Sex Age Weight Consent
## 1 Male 50 70.8 TRUE
s <- letters[1:5]
s[c(1,3)]
## [1] "a" "c"
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
## [1] "a" "c"
TRUE and FALSE values to subset the vectora <- 1:5
a < 3
## [1] TRUE TRUE FALSE FALSE FALSE
s[a < 3]
## [1] "a" "b"
<, >, <=, >=, ==, !=!, &, |, xor
TRUE, FALSE)s
## [1] "a" "b" "c" "d" "e"
a
## [1] 1 2 3 4 5
s[a > 1 & a <3]
## [1] "b"
s[a == 2]
## [1] "b"
country, continent, and height
country a character vector but continent a factorsummary function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)patients[patients$Age < 40,]
patients[patients$Consent == TRUE,]
patients[patients$Sex == "Male" & patients$Weight >= 70.8,]
read.table()read.csv(), read.delim()write.table()write.csv()setwd("/home/participant/Course_Materials")
Before we even start the analysis, we need to be sure of where the data are located on our hard drive
getwd()
list.files()
file.chooserawdata)rawData <- read.delim("countData.txt")
file.choose()myfile <- file.choose()
rawData <- read.delim(myfile)
sep="," or use read.csv:read.csv("countData.csv")
?read.table
Always check the object to make sure the contents and dimensions are as you expect
rawData[1:10,] # View the first 10 rows to ensure import is OK
## Patient Nuclei NB_Amp NB_Nor NB_Del
## 1 1 65 0 63 2
## 2 2 51 3 43 5
## 3 3 37 2 35 0
## 4 4 37 2 35 0
## 5 5 45 2 42 1
## 6 6 46 4 41 1
## 7 7 65 1 64 0
## 8 8 59 1 54 4
## 9 9 49 0 48 1
## 10 10 46 0 45 1
View function to get a display of the data in RStudioView(rawData)
Once we have read the data successfully, we can start to interact with it
class(rawData)
## [1] "data.frame"
ncol(rawData)
## [1] 5
nrow(rawData)
## [1] 50
dim(rawData)
## [1] 50 5
colnames(rawData)
## [1] "Patient" "Nuclei" "NB_Amp"
## [4] "NB_Nor" "NB_Del"
rawData$Nuclei
## [1] 65 51 37 37 45 46 65 59 49 46 38 46
## [13] 34 59 52 66 68 49 62 64 41 53 38 35
## [25] 45 41 49 53 47 38 61 52 59 69 59 33
## [37] 44 53 62 63 70 30 39 34 60 61 70 58
## [49] 39 61
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)
You will make your life a lot easier if you keep your data tidy
http://vita.had.co.nz/papers/tidy-data.pdf
and organised
NA values, which means the values are missing – a common occurrence in real data collectionNA is a special value that can be present in objects of any type (logical, character, numeric etc)NA is not the same as NULL.
NULL is an empty R object.NA is one missing value within an R object (like a data frame or a vector)NAs gracefullyx <- c(1,NA,3)
length(x)
## [1] 3
NAs, and functions often have their own arguments (like na.rm) for handling themmean(x, na.rm=TRUE)
## [1] 2
mean(na.omit(x))
## [1] 2
which function to select indices from a logical vector that are TRUE# create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei
prop > 0.33
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [7] FALSE FALSE FALSE FALSE TRUE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE
## [19] FALSE FALSE FALSE FALSE TRUE FALSE
## [25] NA FALSE NA FALSE FALSE FALSE
## [31] FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE
## [43] NA FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE
# Get sample names of amplified patients:
amp <- which(prop > 0.33)
amp
## [1] 11 23
plot(prop, ylim=c(0,1))
abline(h=0.33) # Add a horizonal line
write.csv(rawData[amp,], file="selectedSamples.csv")
getwd() # print working directory
list.files() # list files in working directory
(NB_Amp / Nuclei < 0.33 & NB_Del == 0)norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
## [1] 3 4 7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")
lattice, ggplot2plot will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
patients$Weight
## [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5
## [8] 71.5 73.2 64.8
plot(patients$Weight)
R tries to guess the most appropriate way to visualise the data
plot. It will put the values from the first argument in the x axis, and values from the second argument on the y axis
patients$Age
## [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Age,patients$Weight)
plot functionbarplotbarplot(patients$Age)
barplot(summary(patients$Sex))
hist(patients$Weight)
boxplot(patients$Weight)
y ~ x means put continuous variable y on the y axis and categorical x on the x axisboxplot(patients$Weight~patients$Sex)
example(dotchart)example(stripchart)example(vioplot) ### From vioplot libraryexample(beeswarm) ## From beeswarm libraryozone.csv
weather <- read.csv("ozone.csv")
View(weather)
| Ozone | Solar.R | Wind | Temp | Month | Day |
|---|---|---|---|---|---|
| 41 | 190 | 7.4 | 67 | 5 | 1 |
| 36 | 118 | 8.0 | 72 | 5 | 2 |
| 12 | 149 | 12.6 | 74 | 5 | 3 |
| 18 | 313 | 11.5 | 62 | 5 | 4 |
| NA | NA | 14.3 | 56 | 5 | 5 |
| 28 | NA | 14.9 | 66 | 5 | 6 |
plot(weather$Solar.R,weather$Ozone)
hist(weather$Temp)
boxplot(weather$Ozone)
boxplot(weather$Ozone~weather$Month)
plot comes with a whole array of arguments that can be set when we call the function
?plot and ?parplot(patients$Weight,type="l")
plot(patients$Weight,type="b")
main argumentplot(patients$Age,patients$Weight,main="Relationship between Weight and Age")
plot(patients$Age,patients$Weight,xlab="Weight")
plot(patients$Age,patients$Weight,ylab="Age")
ylim and xlim are used to specify axis limitsplot(patients$Age,patients$Weight,
ylab="Age",
xlab="Weight",
main="Relationship between Weight and Age",
xlim=c(10,70),
ylim=c(60,80))
"red", "orange","green","blue","yellow"….colours().Changing the col argument to plot changes the colour that the points are plotted in
plot(patients$Age,patients$Weight,col="red")
plot(patients$Age,patients$Weight, pch=16)
plot(patients$Age,patients$Weight, pch="X")
Character expansion
plot(patients$Age,patients$Weight, cex=2)
Character expansion
plot(patients$Age,patients$Weight, cex=0.2)
plot(patients$Age,patients$Weight,
pch = 1:10,
col=c("red","orange","green","blue"),
cex = 1:5)
plot
boxplot(patients$Weight~patients$Sex,
xlab="Sex",
ylab="Weight",
main="Relationship between Weight and Gender",
col=c("blue","yellow"))
breaks and freq arguments to hist (?hist) to create 50 bins and display density rather than frequency?rainbow)plot(weather$Solar.R,weather$Ozone,col="orange",pch=16,
ylab="Ozone level",xlab="Solar Radiation",
main="Relationship between ozone level and solar radiation")
hist(weather$Temp,col="purple",xlab="Temperature",
main="Distribution of Temperature",breaks = 50:100,freq=FALSE)
rainbow function is used to create a vector of colours for the boxplot; in other words a palette
boxplot(weather$Ozone~weather$Month,col=rainbow(5))
RColorBrewer packagelibrary(RColorBrewer)
display.brewer.all()
boxplot(weather$Ozone~weather$Month,col=brewer.pal(5,"Set1"))
library(palettetown)
boxplot(weather$Ozone~weather$Month,col=pokepal("Pikachu",5))